!ls
# might need to install lime in the docker or system I am using
#!pip install lime
#!pip install nltk
import pandas as pd
import numpy as np
import scipy.stats as scs
import statsmodels.api as sm
import matplotlib.pyplot as plt
import lime
import sklearn
import sklearn.ensemble
import sklearn.metrics
from sklearn import feature_extraction
from __future__ import print_function
import nltk
from sklearn.datasets import load_files
nltk.download('stopwords')
from nltk.corpus import stopwords
%matplotlib inline
%config InlineBackend.figure_format='retina'
df = pd.read_csv('small_descr_clm_code.csv')
df.drop('Unnamed: 0',axis=1, inplace=True)
df.head()
def remove_string(dataframe,column_list,string_in_quotes):
'''
Input:
dataframe: name of pandas dataframe
column_list: list of column name strings (ex. ['col_1','col_2'])
string_in_quotes: string to remove in quotes (ex. ',')
Output:
none
modifies pandas dataframe to remove string.
Example:
remove_string(df, ['col_1','col_2'], ',')
Warning:
If memory issues occur, limit to one column at a time.
'''
for i in column_list:
dataframe[i] = dataframe[i].str.replace(string_in_quotes,"").astype(str)
remove_string(df, ['descr'],',')
remove_string(df, ['clm'],',')
remove_string(df, ['descr'],'\n')
remove_string(df, ['clm'],'\n')
df.iloc[0]['clm']
Pandas introduced Categoricals in version 0.15. The category type uses integer values under the hood to represent the values in a column, rather than the raw values. Pandas uses a separate mapping dictionary that maps the integer values to the raw ones. This arrangement is useful whenever a column contains a limited set of values. When we convert a column to the category dtype, pandas uses the most space efficient int subtype that can represent all of the unique values in a column. citation
df['code'] = df['code'].astype('category')
df.info()
df.info(memory_usage='deep')
from sklearn.model_selection import train_test_split
X = df['descr']
y = df['code']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)
Let's use the tfidf vectorizer, commonly used for text.
There are some important parameters that are required to be passed to the constructor of the class. The first parameter is the max_features parameter, which is set to 1500. This is because when you convert words to numbers using the bag of words approach, all the unique words in all the documents are converted into features. All the documents can contain tens of thousands of unique words. But the words that have a very low frequency of occurrence are unusually not a good parameter for classifying documents. Therefore we set the max_features parameter to 1500, which means that we want to use 1500 most occurring words as features for training our classifier.
The next parameter is min_df and it has been set to 5. This corresponds to the minimum number of documents that should contain this feature. So we only include those words that occur in at least 5 documents. Similarly, for the max_df, feature the value is set to 0.7; in which the fraction corresponds to a percentage. Here 0.7 means that we should include only those words that occur in a maximum of 70% of all the documents. Words that occur in almost every document are usually not suitable for classification because they do not provide any unique information about the document.
Finally, we remove the stop words from our text since, in the case of sentiment analysis, stop words may not contain any useful information. To remove the stop words we pass the stopwords object from the nltk.corpus library to the stop_wordsparameter.
The fit_transform function of the CountVectorizer class converts text documents into corresponding numeric features. citation
stopwords removal
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(lowercase=False, max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)
Now, let's say we want to use random forests for classification. It's usually hard to understand what random forests are doing, especially with many trees.
rf = sklearn.ensemble.RandomForestClassifier(n_estimators=500)
rf.fit(train_vectors, y_train) #rf.fit(train_vectors, newsgroups_train.target)
from sklearn.metrics import roc_curve, auc, f1_score, recall_score, precision_score
pred = rf.predict(test_vectors)
sklearn.metrics.f1_score(y_test, pred, average=None)
pred2 = rf.predict(test_vectors)
sklearn.metrics.f1_score(y_test, pred2, average='weighted')
Lime explainers assume that classifiers act on raw text, but sklearn classifiers act on vectorized representation of texts. For this purpose, we use sklearn's pipeline, and implements predict_proba on raw_text lists.
from lime import lime_text
from sklearn.pipeline import make_pipeline
c = make_pipeline(vectorizer, rf)
print(c.predict_proba([X_test.iloc[0]]))
Now we create an explainer object. We pass the class_names a an argument for prettier display.
from lime.lime_text import LimeTextExplainer
class_names = ['705','706']
explainer = LimeTextExplainer(class_names=class_names)
We then generate an explanation with at most 6 features for an arbitrary document in the test set.
idx = 83
exp = explainer.explain_instance(X_test.iloc[idx], c.predict_proba, num_features=6)
print('Document id: %d' % idx)
print('Probability(706) =', c.predict_proba([X_test.iloc[idx]])[0,1])
print('True class: %s' % y_test.iloc[idx])
print(y_test.iloc[idx])
print(X_test.iloc[idx])
exp.as_list()
from docs of lime: These weighted features are a linear model, which approximates the behaviour of the random forest classifier in the vicinity of the test example. Roughly, if we remove 'Posting' and 'Host' from the document , the prediction should move towards the opposite class (Christianity) by about 0.27 (the sum of the weights for both features). Let's see if this is the case. citation
try
print('Original prediction:', rf.predict_proba(test_vectors[idx])[0,1])
tmp = test_vectors[idx].copy()
tmp[0,vectorizer.vocabulary_['transactions']] = 0
tmp[0,vectorizer.vocabulary_['state']] = 0
print('Prediction removing some features:', rf.predict_proba(tmp)[0,1])
print('Difference:', rf.predict_proba(tmp)[0,1] - rf.predict_proba(test_vectors[idx])[0,1])
They probably were not worth much. I need to do word removal or work with n-grams
The explanations can be returned as a matplotlib barplot:
fig = exp.as_pyplot_figure()
The explanations can also be exported as an html page (which we can render here in this notebook), using D3.js to render graphs.
exp.show_in_notebook(text=False)
Alternatively, we can save the fully contained html page to a file:
exp.save_to_file('/tmp/oi_stopwords_removed.html')
Finally, we can also include a visualization of the original document, with the words in the explanations highlighted. Notice how the words that affect the classifier the most are all in the email header.
exp.show_in_notebook(text=True)
from sklearn.model_selection import train_test_split
X = df['clm']
y = df['code']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)
Let's use the tfidf vectorizer, commonly used for text.
from sklearn import feature_extraction
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(lowercase=False, max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)
Now, let's say we want to use random forests for classification. It's usually hard to understand what random forests are doing, especially with many trees.
rf = sklearn.ensemble.RandomForestClassifier(n_estimators=500)
rf.fit(train_vectors, y_train) #rf.fit(train_vectors, newsgroups_train.target)
from sklearn.metrics import roc_curve, auc, f1_score, recall_score, precision_score
pred = rf.predict(test_vectors)
sklearn.metrics.f1_score(y_test, pred, average=None)
Lime explainers assume that classifiers act on raw text, but sklearn classifiers act on vectorized representation of texts. For this purpose, we use sklearn's pipeline, and implements predict_proba on raw_text lists.
from lime import lime_text
from sklearn.pipeline import make_pipeline
c = make_pipeline(vectorizer, rf)
print(c.predict_proba([X_test.iloc[0]]))
Now we create an explainer object. We pass the class_names a an argument for prettier display.
from lime.lime_text import LimeTextExplainer
class_names = ['705','706']
explainer = LimeTextExplainer(class_names=class_names)
We then generate an explanation with at most 6 features for an arbitrary document in the test set.
idx = 83
exp = explainer.explain_instance(X_test.iloc[idx], c.predict_proba, num_features=6)
print('Document id: %d' % idx)
print('Probability(706) =', c.predict_proba([X_test.iloc[idx]])[0,1])
print('True class: %s' % y_test.iloc[idx])
print(y_test.iloc[idx])
print(X_test.iloc[idx])
exp.as_list()
from docs of lime: These weighted features are a linear model, which approximates the behaviour of the random forest classifier in the vicinity of the test example. Roughly, if we remove 'Posting' and 'Host' from the document , the prediction should move towards the opposite class (Christianity) by about 0.27 (the sum of the weights for both features). Let's see if this is the case. citation
try
print('Original prediction:', rf.predict_proba(test_vectors[idx])[0,1])
tmp = test_vectors[idx].copy()
tmp[0,vectorizer.vocabulary_['provider']] = 0
tmp[0,vectorizer.vocabulary_['state']] = 0
print('Prediction removing some features:', rf.predict_proba(tmp)[0,1])
print('Difference:', rf.predict_proba(tmp)[0,1] - rf.predict_proba(test_vectors[idx])[0,1])
They probably were not worth much. I need to do word removal or work with n-grams
The explanations can be returned as a matplotlib barplot:
fig = exp.as_pyplot_figure()
The explanations can also be exported as an html page (which we can render here in this notebook), using D3.js to render graphs.
exp.show_in_notebook(text=False)
Alternatively, we can save the fully contained html page to a file:
exp.save_to_file('/tmp/oi_claim_stopwords_removed.html')
Finally, we can also include a visualization of the original document, with the words in the explanations highlighted. Notice how the words that affect the classifier the most are all in the email header.
exp.show_in_notebook(text=True)